Skip to content

VReplication: Improve Error/Status Reporting#12052

Merged
mattlord merged 11 commits intovitessio:mainfrom
planetscale:vrepl_progress_improvements
Jan 12, 2023
Merged

VReplication: Improve Error/Status Reporting#12052
mattlord merged 11 commits intovitessio:mainfrom
planetscale:vrepl_progress_improvements

Conversation

@mattlord
Copy link
Copy Markdown
Member

@mattlord mattlord commented Jan 6, 2023

Description

This PR addresses the issues raised in the bug report along with some other related improvements that I saw were needed as I worked on the new troubleshooting guide:

  1. Adds unit tests for the Progress (which is a parent of Show) action
  2. When the stream is in an error state, show the error message in the Show/Progress action output. For example:
    $ vtctlclient MoveTables -- Progress customer.commerce2customer
    The following vreplication streams exist for workflow customer.commerce2customer:
    
    id=1 on 0/zone1-0000000201: Status: Error: Duplicate entry '6' for key 'customer.PRIMARY' (errno 1062) (sqlstate 23000) during query: insert into customer(customer_id,email) values (6,'mlord@planetscale.com').
    
  3. Adds additional errors to the unRecoverableError() error list as things like "Unknown column 'notes' in 'field list' (errno 1054) (sqlstate 42S22) during query: insert into customer(customer_id,email,notes) values (100,'test@tester.com','Lots of notes')" were not covered AND keeping in mind that users can now modify the on-ddl value and DDL may come through the stream
Click here to see new and improved outputs from the test case in the bug report:
$ vtctlclient MoveTables -- --tablet_types=RDONLY --source commerce --tables 'customer,corder' Create customer.commerce2customer
Waiting for workflow to start:
0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ... 0% ...
The workflow did not start within 30s. The workflow may simply be slow to start or there may be an issue.
Check the status using the 'Workflow customer.commerce2customer show' client command for details.
MoveTables Error: rpc error: code = Unknown desc = timed out waiting for workflow to start
E0106 15:09:23.724578   99838 main.go:105] remote error: rpc error: code = Unknown desc = timed out waiting for workflow to start
$ vtctlclient Workflow -- customer.commerce2customer show
{
	"Workflow": "commerce2customer",
	"SourceLocation": {
		"Keyspace": "commerce",
		"Shards": [
			"0"
		]
	},
	"TargetLocation": {
		"Keyspace": "customer",
		"Shards": [
			"0"
		]
	},
	"MaxVReplicationLag": 51,
	"MaxVReplicationTransactionLag": 1673036368,
	"Frozen": false,
	"ShardStatuses": {
		"0/zone1-0000000200": {
			"PrimaryReplicationStatuses": [
				{
					"Shard": "0",
					"Tablet": "zone1-0000000200",
					"ID": 1,
					"Bls": {
						"keyspace": "commerce",
						"shard": "0",
						"filter": {
							"rules": [
								{
									"match": "customer",
									"filter": "select * from customer"
								},
								{
									"match": "corder",
									"filter": "select * from corder"
								}
							]
						}
					},
					"Pos": "",
					"StopPos": "",
					"State": "Error",
					"DBName": "vt_customer",
					"TransactionTimestamp": 0,
					"TimeUpdated": 1673036317,
					"TimeHeartbeat": 0,
					"TimeThrottled": 0,
					"ComponentThrottled": "",
					"Message": "Error picking tablet: context has expired",
					"Tags": "",
					"WorkflowType": "MoveTables",
					"WorkflowSubType": "None",
					"CopyState": null
				}
			],
			"TabletControls": null,
			"PrimaryIsServing": true
		}
	},
	"SourceTimeZone": "",
	"TargetTimeZone": ""
}
# When we're trying to start, but have not yet produced an error or made any progress
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream has not started.


# When we've encountered an error
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000201: Status: Error: Error picking tablet: context has expired.


# When the copy phase is actually in progress
$ vtctlclient MoveTables -- Progress customer.commerce2customer
Copy Progress (approx):

corder: rows copied 0/5 (0%), size copied 16384/16384 (100%)
customer: rows copied 0/5 (0%), size copied 16384/16384 (100%)


The following vreplication streams exist for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Copying. VStream Lag: 0s.


# When the copy phase is actually completed
$ vtctlclient MoveTables -- Progress customer.commerce2customer
The following vreplication streams exist for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream Lag: 0s.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added
  • Documentation was added in the User Guides update

Signed-off-by: Matt Lord <mattalord@gmail.com>
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Jan 6, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 482a586 to ec67dcc Compare January 6, 2023 21:13
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 29ccc27 to 6186b01 Compare January 7, 2023 18:53
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 6186b01 to da59866 Compare January 7, 2023 19:24
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 30cf297 to fc0895c Compare January 8, 2023 00:58
@mattlord mattlord marked this pull request as ready for review January 8, 2023 02:12
@mattlord mattlord removed the request for review from notfelineit January 8, 2023 02:13
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 8de0fb5 to 6b623c5 Compare January 8, 2023 02:56
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vrepl_progress_improvements branch from 6b623c5 to c487d1c Compare January 8, 2023 02:59
Also keeping mind that users can now modify the on_ddl
value so that DDL comes through the stream.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord requested review from shlomi-noach and removed request for ajm188, deepthi and systay January 11, 2023 04:48
ERServerShutdown = 1053

// not found
ERDbDropExists = 1008
Copy link
Copy Markdown
Member Author

@mattlord mattlord Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file were just me ordering the subgroups by error code as it makes it easier to walk through all of them as I did to see what ones should be in the vreplication.unRecoverableError() list.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple minor comments at your discretion

case <-timedCtx.Done():
wr.Logger().Printf("\nThe workflow did not start within %s. The workflow may simply be slow to start or there may be an issue.\n",
(*timeout).String())
wr.Logger().Printf("Check the status using the 'Workflow %s show' client command for details.\n", ksWorkflow)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it possible to check the status at this point and print the output? Save the user an extra step?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the output is already a little noisy and at this point (at least sub-second before) it had not yet started. I also like that we're informing/reminding them of the way to monitor the status and see the workflow details.

@shlomi-noach
Copy link
Copy Markdown
Contributor

While ticking the review template, I noted

If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

Could you please do that for the new TestMoveTables? (noting to myself to similarly document my tests)

Copy link
Copy Markdown
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit 0f4cd2d into vitessio:main Jan 12, 2023
@mattlord mattlord deleted the vrepl_progress_improvements branch January 12, 2023 15:49
mattlord added a commit to vitessio/website that referenced this pull request Sep 20, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added a commit to vitessio/website that referenced this pull request Sep 22, 2023
Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: VReplication Create/Progress Cmds Misleading/Incorrect

3 participants